Many R packages help you make better plots and tables. Illustration by Allison Horst.
Updates and reminders
Homework and peer reviews are graded for completion
Office hours are now Mon 4-5pm (Zoom and my office) and Fri 2-3pm (Zoom only)
Outline for today
Overview of terminology and different data types
Better plots
Better tables
Terminology vocab list
Here are some terms we’ve been using so far.
R/coding Terminology:
- Function
- Argument (input)
- Index
- Object
- Operator (extract operator `$` or multiplication operator `*`)
- Operation
- Assignment
- Call (an object or function)
- Object name
- Filename vs. title
- Code comments
- Command line
- Base R
- Environment
- Package: bundle of code/functions and data
- Installing versus loading packages
Quarto/Markdown terminology:
- Markdown language
- YAML
- YAML header
- Code chunks
- Chunk options
- Inline R code
Data types
The main data types we have seen are numeric (num), integer (int), character (chr), and logical (logi). We’ll also talk a bit about a fifth type, factor.
You can use str() (and often class() and typeof()) to see the data type of a particular object or variable.
Numeric and integer refer to numbers. Integers are whole numbers only, whereas a numeric-type object can have decimal values.
Character-type objects are values in quotes and are treated like text.
You can change datatypes using as.integer, as.logical, and so on.
identical(1.3, as.numeric("1.3"))
[1] TRUE
identical(1L, as.integer("1"))
[1] TRUE
identical("1", as.character(1))
[1] TRUE
identical(TRUE, as.logical(1))
[1] TRUE
The data type matters when you try to do certain operations:
1+2"1"+"2"as.character(1) +as.character(2)
In R, factors are useful for working with categorical variables (variables that have a fixed, known set of possible values) such as country and continent in the gapminder dataset. They are also handy if you want to display character vectors in a non-alphabetical order. We’ll investigate this further in the homework.
Object types
Vector elements Dataframes Also lists, matrices, arrays, and more
Building a nicer plot step by step
Let’s build up a more complex plot step by step to learn how to do a bunch of other things. We’ll start with a line plot of life expectancy over time by country. (Why is group = country necessary??) Each time we add something to the plot, I’m just going to copy and paste the previous code into a new code chunk, then make one or more changes.
ggplot(data = gapminder, aes(x = year, y = lifeExp, group = country)) +geom_line()
Now let’s split each continent into a different subplot. Is group = country still necessary? Why/why not?
ggplot(data = gapminder, aes(x = year, y = lifeExp, group = country)) +geom_line() +facet_wrap(~ continent, nrow =2)
Let’s change the theme:
ggplot(data = gapminder, aes(x = year, y = lifeExp, group = country)) +geom_line() +facet_wrap(~ continent, nrow =2) +theme_minimal()
Let’s fix the axis labels and add a title and subtitle. You can do this with separate functions like xlab() and ggtitle() or with a single function like labs(). We’ve seen the separate functions before, so let’s try labs() this time:
ggplot(data = gapminder, aes(x = year, y = lifeExp, group = country)) +geom_line() +facet_wrap(~ continent, nrow =2) +labs(x ="", y ="Life Expectancy (Years)",title ="Life Expectancy, 1952-2007", subtitle ="By continent and country") +theme_minimal()
Notice that this time I added the last command (labs()) before a previous command (theme_minimal()), not at the end. For the commands I have here (labs() and theme_minimal()), the order doesn’t matter, but I tend to try to put geometry layers (geom_xx()) toward the beginning of the block of plot code and theme settings at the end of it.
The x-axis is cluttered and difficult to read, so let’s fix that. We can start by having fewer axis tick labels:
ggplot(data = gapminder, aes(x = year, y = lifeExp, group = country)) +geom_line() +facet_wrap(~ continent, nrow =2) +labs(x ="", y ="Life Expectancy (Years)",title ="Life Expectancy, 1952-2007", subtitle ="By continent and country") +scale_x_continuous(breaks =seq(1950, 2010, 20)) +theme_minimal()
Hmm, it doesn’t show a label at the right edge of the plot. If we want that to change, we can do that:
ggplot(data = gapminder, aes(x = year, y = lifeExp, group = country)) +geom_line() +facet_wrap(~ continent, nrow =2) +labs(x ="", y ="Life Expectancy (Years)",title ="Life Expectancy, 1952-2007", subtitle ="By continent and country") +scale_x_continuous(limits =c(1950, 2010), breaks =seq(1950, 2010, 20)) +theme_minimal()
Now the ends of the x-axes run together, so we can fix that by increasing the space or padding between the subplots. I didn’t remember how to do this, so I googled “r ggplot increase space between panels” to figure it out.
Let’s color the lines by continent. Notice this automatically generates a legend.
ggplot(data = gapminder, aes(x = year, y = lifeExp, group = country, color = continent)) +geom_line() +facet_wrap(~ continent, nrow =2) +labs(x ="", y ="Life Expectancy (Years)",title ="Life Expectancy, 1952-2007", subtitle ="By continent and country") +scale_x_continuous(limits =c(1950, 2010), breaks =seq(1950, 2010, 20)) +theme_minimal() +theme(panel.spacing =unit(2, "lines"),axis.text.x =element_text(angle =30))
We can move the legend:
ggplot(data = gapminder, aes(x = year, y = lifeExp, group = country, color = continent)) +geom_line() +facet_wrap(~ continent, nrow =2) +labs(x ="", y ="Life Expectancy (Years)",title ="Life Expectancy, 1952-2007", subtitle ="By continent and country") +scale_x_continuous(limits =c(1950, 2010), breaks =seq(1950, 2010, 20)) +theme_minimal() +theme(panel.spacing =unit(2, "lines"),legend.position =c(0.8, 0.2)) # or "bottom" or many other options
If you want to change the colors or the color palette, you can do that using, for example, scale_color_manual. We can also use that same function to rename the legend. Notice that
ggplot(data = gapminder, aes(x = year, y = lifeExp, group = country, color = continent)) +geom_line() +facet_wrap(~ continent, nrow =2) +labs(x ="", y ="Life Expectancy (Years)",title ="Life Expectancy, 1952-2007", subtitle ="By continent and country") +scale_x_continuous(limits =c(1950, 2010), breaks =seq(1950, 2010, 20)) +scale_color_manual(name ="Which continent are\nwe looking at?", # \n adds a line break values =c("Africa"="#4e79a7", "Americas"="#f28e2c", "Asia"="#e15759", "Europe"="#76b7b2", "Oceania"="#59a14f")) +theme_minimal() +theme(panel.spacing =unit(2, "lines"),legend.position =c(0.8, 0.2)) # or "bottom" or many other options
Let’s add a line to each plot that represents the continent average:
ggplot(data = gapminder, aes(x = year, y = lifeExp, group = country, color = continent)) +geom_line() +geom_line(stat ="smooth", aes(group = continent)) +facet_wrap(~ continent, nrow =2) +labs(x ="", y ="Life Expectancy (Years)",title ="Life Expectancy, 1952-2007", subtitle ="By continent and country") +scale_x_continuous(limits =c(1950, 2010), breaks =seq(1950, 2010, 20)) +scale_color_manual(name ="Which continent are\nwe looking at?", # \n adds a line break values =c("Africa"="#4e79a7", "Americas"="#f28e2c", "Asia"="#e15759", "Europe"="#76b7b2", "Oceania"="#59a14f")) +theme_minimal() +theme(panel.spacing =unit(2, "lines"),legend.position =c(0.8, 0.2)) # or "bottom" or many other options
Oops, we can’t see the line! We should make it stand out as different from the other country-specific lines. Let’s make it thicker and black:
ggplot(data = gapminder, aes(x = year, y = lifeExp, group = country, color = continent)) +geom_line() +geom_line(stat ="smooth", aes(group = continent),color ="black", size =1) +facet_wrap(~ continent, nrow =2) +labs(x ="", y ="Life Expectancy (Years)",title ="Life Expectancy, 1952-2007", subtitle ="By continent and country") +scale_x_continuous(limits =c(1950, 2010), breaks =seq(1950, 2010, 20)) +scale_color_manual(name ="Which continent are\nwe looking at?", # \n adds a line break values =c("Africa"="#4e79a7", "Americas"="#f28e2c", "Asia"="#e15759", "Europe"="#76b7b2", "Oceania"="#59a14f")) +theme_minimal() +theme(panel.spacing =unit(2, "lines"),legend.position =c(0.8, 0.2)) # or "bottom" or many other options
This probably gives you this warning:
What does it mean that the size aesthetic was “depracated”? It means that it used to be an argument and technically you can still use it but they’re trying to phase it out. So they’d rather you use something else instead– they tell you the new argument to use is linewidth. So let’s change that:
ggplot(data = gapminder, aes(x = year, y = lifeExp, group = country, color = continent)) +geom_line() +geom_line(stat ="smooth", aes(group = continent),color ="black", linewidth =1) +facet_wrap(~ continent, nrow =2) +labs(x ="", y ="Life Expectancy (Years)",title ="Life Expectancy, 1952-2007", subtitle ="By continent and country") +scale_x_continuous(limits =c(1950, 2010), breaks =seq(1950, 2010, 20)) +scale_color_manual(name ="Which continent are\nwe looking at?", # \n adds a line break values =c("Africa"="#4e79a7", "Americas"="#f28e2c", "Asia"="#e15759", "Europe"="#76b7b2", "Oceania"="#59a14f")) +theme_minimal() +theme(panel.spacing =unit(2, "lines"),legend.position =c(0.8, 0.2)) # or "bottom" or many other options
Your Turn
Let’s keep going!
Make the continent average line a little transparent so that you can see the other lines through it.
Rename the legend “Continent”.
Put all the continents in a single row instead of having this two-row plot arrangement.
Change the font size.
Set the panel.spacing using a different unit than lines.